Exploring the power of data mining for uncovering traditional medicinal plant knowledge: A case study in Shahrbabak, Iran

The present study recorded indigenous knowledge of medicinal plants in Shahrbabak, Iran. We described a method using data mining algorithms to predict medicinal plants’ mode of application. Twenty-oneindividuals aged 28 to 81 were interviewed. Firstly, data were collected and analyzed based on quantitative indices such as the informant consensus factor (ICF), the cultural importance index (CI), and the relative frequency of citation (RFC). Secondly, the data was classified by support vector machines, J48 decision trees, neural networks, and logistic regression. So, 141 medicinal plants from 43 botanical families were documented. Lamiaceae, with 18 species, was the dominant family among plants, and plant leaves were most frequently used for medicinal purposes. The decoction was the most commonly used preparation method (56%), and therophytes were the most dominant (48.93%) among plants. Regarding the RFC index, the most important species are Adiantum capillus-veneris L. and Plantago ovata Forssk., while Artemisia auseri Boiss. ranked first based on the CI index. The ICF index demonstrated that metabolic disorders are the most common problems among plants in the Shahrbabak region. Finally, the J48 decision tree algorithm consistently outperforms other methods, achieving 95% accuracy in 10-fold cross-validation and 70–30 data split scenarios. The developed model detects with maximum accuracy how to consume medicinal plants.


Introduction
The utilization of plants for traditional medicine and health purposes has been around since ancient times and is becoming increasingly popular in many parts of the world [1,2].Medicinal plants are rich in effective substances that treat various diseases [3].For novel drug development, the first and most critical stage is the collection and analysis of information on medicinal plants used by various indigenous cultures [4].Ethnopharmacological studies are necessary to obtain the past and present state of cultural habits about plants around the world.It is also essential to record indigenous people's knowledge of medicinal plants [1,5,6].Furthermore, it supports the preservation of traditional knowledge for future generations and other communities [7,8].Throughout Iran, ethnopharmacological studies have been conducted on plants for many years [9][10][11][12][13][14][15][16].With 900,000 ha of natural resources, the Shahrbabak has many medicinal plants.However, the existing literature indicates a significant gap in our understanding of how local populations utilize these species for medicinal purposes and disease treatment.
Data mining emerged in the mid-1990s as a method for uncovering hidden knowledge.Data mining can identify complexity, discover potential causal relationships, and find hidden relationships, and correlations between variables [17].
• Data mining is the science of extracting patterns, information, and analysis from raw datasets produced by an organization, a society, or any other set.Data mining transforms useless information into useful information by obtaining valuable results.At a more detailed level, data mining is a step in the Knowledge Discovery in Databases (KDD) process.Generally, four stages or main steps can be considered for data mining: determining goals, collecting and preparing data, extracting patterns, and evaluating the results [18].Data mining algorithms are divided into four categories based on performance: Classification (supervised learning): In this type of learning, a set of samples with their labels is provided to the model, and the model must establish a relationship between the examples and their labels.This algorithm can learn from the labeling model and use data mining algorithms to label and separate new samples.Classification algorithms include decision trees, support vector machines, neural networks, and logistic regression.
• Clustering (unsupervised learning): In this case, the algorithm divides the data into groups based on their similarities.Unsupervised learning uses unlabeled samples.In these algorithms, a cost function and a distance measurement are defined.The algorithms should reduce the cost function value according to distance measurement.
• Semi-supervised learning involves labeled and unlabeled data.Semi-supervised learning methods are somewhere between unsupervised and supervised learning methods.
• Reinforcement learning, the algorithm continuously discovers and learns by exchanging information and operations with the surrounding environment.When a machine receives a reward, it can learn how to improve itself to receive more rewards in the future.This is done by performing specific actions [19].
We used different data mining algorithms for prediction.The most crucial classification algorithms used in this article are described below.
• Decision tree is a supervised learning (classification) method.A decision tree has a structure where an internal node represents an attribute, a branch means a decision rule, and each leaf node indicates an outcome.The highest node in the decision tree is known as the root node, which is the highest level of the tree.A decision tree is suitable for establishing non-linear relationships between features and classes.The decision tree is flexible because it can easily model non-linear or unconventional relationships.It can interpret the interaction between predictors.This method can also be well interpreted due to its binary structure [19,20].
Support Vector Machine is a supervised learning algorithm that controls and solves classification problems.This algorithm is applied to different classification fields.SVM is designed to achieve the goal of class members having the least distance from each other and the maximum length from other classes.This technique is a supervised learning model used for linear and non-linear classification.The basis of the work of SVM classifiers is the linear classification of the data.In the linear division of the data, an attempt is made to select a line with a higher margin of confidence [19,20].
Logistic regression is a classification algorithm that assigns observed samples to a distinct set of classes.Unlike linear regression, which produces continuous numerical values, it uses the logistic sigmoid function to transform its output to return a probability value that can be mapped into two or more distinct classes.Logistic regression works well when the data relationship is almost linear but poorly if non-linear relationships exist between the variables [19,20].
Artificial Neural Networks (ANN) is an information processing paradigm inspired by biological neural systems such as the brain that process information.ANN consists of several layers of simple processing elements called neurons.The neuron performs the two functions of collecting inputs and producing an output.Using ANN provides an overview of the theory, learning rules, and applications of the most important neural network models, definitions, and computational styles [21].
Data quality is vital in data analysis because incorrect data leads to wrong results.Fast detection of data quality issues reduces the effort and time needed to find and analyze them.Therefore, it is necessary to use data mining methods to find defects and fix wrong data [22].
Data mining starts with raw data and continues until new knowledge is formed.Data cleansing refers to identifying, removing, and correcting wrong data from tables, records, or databases.It also includes identifying incomplete and incorrect data parts and correcting and replacing them.Data Integration is collecting data from multi-source systems to create single sets of information for operational and analytical applications.During the Data Selection section, the dataset should be selected and retrieved.Sometimes, to increase the accuracy of the analysis, we have to change the raw data available for analysis.One of these changes is the Data Transformation process [23].In addition, we need to identify the right features.Choosing the most critical features improves the efficiency of data mining algorithms and data understanding, reduces algorithm execution time, reduces data storage volume, and simplifies the model.Feature selection methods are divided into Filters, Wrappers, and Embedded [24].
Very few studies have been performed on data mining methods to increase the discovery of hidden knowledge of ethnopharmacology.It has been found that data mining in ethnopharmacology has two crucial advantages.First, it utilizes qualitative and quantitative data (such as observations and sensor information) to study practically inaccessible phenomena through each data type alone.Second, it provides a means of interpreting that data, which produces novel insights by exposing the biases inherent in each data type alone [25].Axiotis et al. performed a study using an intelligent search system to support ethnopharmacological research through a combination of active learning and reinforcement learning.They reported that Machine learning-powered research improved the effectiveness and efficiency of the domain expert by 3.1 and 5.14 times, respectively.This was done by fetching 420 relevant ethnopharmacological documents in only seven hours versus an estimated 36 hours of human effort [26].The current study documented ethnomedicinal knowledge of medicinal plants in the Shahrbabak region in the southeastern part of Iran, within Kerman Province.This work analyzed medicinal plants used for treating various diseases.Also, for the first time, we recommend modeling and comparing data mining algorithms to predict medicinal plant modes of application.Our study combines qualitative ethnobotanical fieldwork with advanced data mining approaches to create a systematic framework for collecting and analyzing traditional medicinal plant data.Also, the current study demonstrates the potential of data mining as a tool for unlocking valuable insights from traditional knowledge systems.

Study area
The city of Shahrbabak is situated (30˚11´63˝N 55˚11´86˝E) North-West of Kerman Province with an area of 13500 square kilometers and an altitude of 1845 m above sea level.According to the 2006 Iranian census, this area had 43,916 residents.Shahrbabak is an ancient Iranian city.Meymand, one of Iran's four ancient villages, is 36 kilometers from Shahrbabak.This town is near Sarcheshmeh and Miedook, Iran's largest copper mines.Historians say this town was built by the Sassanid king Ardeshir Babakan 1800 years ago.Shahrbabak has a semiarid climate with hot and dry summers and cold and dry winters.This region's annual temperature, average rainfall, and humidity ranges are 16.2˚C, 162 mm, and 34%, respectively.

Collecting data and identifying plants
Data were collected from different parts of the Shahrbabak district, North-West of Kerman Province.The interviewees were identified as indigenous practitioners, sellers, shepherds and medicinal herb vendors who assisted with identifying plants they regarded as medicinal (the questionnaire is accessible through an online Supplementary file).Twenty-oneindividuals (11 females and 10 males) aged 28 to 81 were interviewed.Plants were collected from Robat, Meymand, Khatoun Abad, Estabragh, Mehrabad, Dehej-Jowzam, Abdar, Barfe, and Khabr regions, all parts of the Shahrbabak district.Information on vernacular names, herbal part(s) as pharmacological agents, medicinal uses, methods of treatment and preparation was recorded, shown in Table 1.The plants were dried, labeled, and preserved in the Herbarium of the Biology Department at the University of Jiroft for identification and future work.Medicinal plants were identified using Iranica flora [27], Palestine flora [28], Iraq flora [29], Turkey flora [30], and Iran flora (in color) [31].Plant life cycles were classified according to Raunkiaer's system [32].

Data analysis
Ethnomedicine information was evaluated using plant medicinal reports.Three variables were used to define this indicator: i, u, and s.Accordingly, the informant 'i' mentions the use of species 's' in a specific category of use 'u'.The number of medicinal plants and the number of informants reporting the use of a species were counted.We also calculated quantitative value indices.

Informant consensus factor (ICF)
The Informant Consensus Factor (ICF) was applied to determine the homogeneity of information.The claims regarding medicinal uses are termed 'citations', which were classified into ailment categories where each plant was deemed adequate.The ICF index was estimated as follows: Here, 'Nur' represents the number of citations used in each category whereas 'Nt' represents the number of species used for medicinal purposes [33].
The following formula was employed to compute the relative frequency of citation (RFC) index: This index (RFC) was computed when the frequency of citation (FC) (i.e. the number of interviewees who mentioned a beneficial species) was divided by the total number of participants in the survey (N).The RFC index ranged from 0 (when no informants mentioned it as beneficial) to 1 (when all interviewees revealed it as beneficial).
Using the following equation, the cultural importance index (CI) was estimated: UR ui N A CI index considers the frequency of use of a species (according to the number of informants) and the number of cases in which it is used.
The correlation between an informant's age and the number of applications reported by each informant mentioned for a given medicinal plant was obtained by the coefficient of determination (R 2 ).For this purpose, each informant's age was scored according to the following ranges: 1 (for 28-40), 2 (for 40-50), 3 (for 50-60), 4 (for 60-70) and 5 (for 70-81).

Methodology proposed
The proposed method involves three steps.Preprocessing the data is the first step.This approach eliminates unnecessary features and simplifies them.By removing inconsistent features, predictions can be improved and execution times reduced.At this stage, data mining algorithms depend on uppercase letters, lowercase letters, text spaces, etc.Therefore, the data is integrated first.After that, one of the sample values of the "mode of application" column is empty, and from where the value of this column is empty.So, this sample is removed from the dataset.Also, some columns that have no role in the prediction process such as "Scientific name", "Vernacular name (Persian)", "Voucher no" are removed from the dataset.
In the second step, the preprocessed samples are divided into training and testing.In the current study, 70% of the data is used for training and 30% for testing.The 10-fold cross-validation method has been used to train the proposed model.
Once the data has been preprocessed and converted into an optimal dataset, it can be fed into different classification algorithms (supervised learning).To classify the dataset, support vector machines, J48 decision trees, neural networks, and logistic regression were employed.All data mining algorithms are tested on the dataset with different parameters.
In the third step, the proposed model is evaluated using various model evaluation criteria.Several criteria are used to evaluate the developed method, including F-measure, Recall, Precision, Receiver Operating Characteristic (ROC), and Cohen's Kappa.

Results and discussion
Shahrbabak region's choice to study ethnopharmacology was determined for several reasons.Firstly, its historical significance as an ancient Iranian city believed to have been established by Sassanid king Ardeshir Babakan around 1800 years ago offers a rich cultural and historical context for ethnopharmacological research [34].Additionally, this town is amidst a semi-arid climate and close to significant copper mines, influencing local ethnopharmacological practices and offering a rich cultural context for research.This study will significantly benefit researchers, scientists, herbal enthusiasts, and pharmaceutical professionals.Researchers can use data mining techniques to analyze and interpret traditional medicinal plant knowledge.A valuable resource has been provided to pharmaceutical researchers by the current study, allowing them to explore and develop novel drugs in the future.This contributes to pharmaceutical science advancement and healthcare solutions improvement.

Plant diversity
All 141 plant species in this study were considered medicinal by indigenous people.These medicinal plants are from 43 different families, among which the Lamiaceae has 18 species, Fabaceae has 17 and Apiaceae has 16 species, which were found to be the most frequently occurring families among the 141 species in the area, followed by Asteraceae with 13 species (Fig 2).According to a previous report from Iran, the Lamiaceae and Apiaceae families have the highest number of medicinal plants in their local area.[16].According to other studies, the Lamiaceae are the most abundant family of medicinal plants in the Kerman province [15].Furthermore, the Lamiaceae family has plants with medicinal properties that enable their use as sources of traditional drugs.These can be applied to digestive disorders, menstrual disorders, hepatitis, and liver diseases [35,36].

Plant parts are used as medicinal agents
Local people reported using different plant parts.The most common parts used were leaf (17.7%), seed (17.1%), aerial parts (16.6%), fruit (15.1%) and flower (11.9%), respectively (Fig 3).In contrast, indigenous people were least likely to use whole plants, corns, gums, skin, capsules, and rhizomes of plants.The leaf was the most popular, which could be explained botanically by photosynthesis-producing compounds such as chlorophyll, flavonoids, alkaloids, and other bioactive molecules [37,38].These results support previous reports that leaf, fruit, and aerial parts are mainly medicinal [12,16].

Preparation and modes of application
The decoction was found to be the most frequently used method (56%) for preparing plant materials before medicine application.Other methods of preparing the plants were by processing them as freshly cooked (with a prevalence of 17%), using them by infusion (14%) and as liniment (12%) (Fig 4).Due to its ease of use, decoction is usually the most widely used method for medicinal plant preparation before consumption [39].Use methods are oral, topical and combined.In the available literature, most plants are reported to be consumed orally, while the topical mode of application is subordinate to oral consumption (Table 1).For instance, Centaurium pulchellum subsp.grandiflorum (Batt.)Maire, Geranium rotundifolium L. and Berberis jamesiana Forrest & W.W.Sm.. are used only as oral, whereas Ducrosia assadii Alava.and Cymbopogon schoenanthus (L.) Spreng.are used by topical modes of application, and some species such as Sanguisorba minor Scop.and Papaver dubium L. are used orally and topically concurrently.Various methods are employed for plant preparation and application, emphasizing the diversity and complex nature of traditional medicinal practices.Traditional knowledge of medicinal plant use can be gained by understanding the various modes of application.This highlights the potential and adaptability of these natural resources to address healthcare needs [40].

The life cycle of plants
An analysis of the life cycles shows the dominance of Therophytes (48.93%) and Geophytes (21.98%) in the flora species in the current study (Fig 5).Geophytes and Therophytes in a region can provide valuable information about the availability and seasonal variations of medicinal plants used by local communities [41].As a result of Therophytes' ability to grow in adverse conditions and germinate quickly after rain, medicinal plant resources may be abundant during certain seasons [42].Additionally, Geophytes can store vital nutrients underground, ensuring a continuous supply of medicinal plants, especially in arid climates [43].

Records and categories
Based on the data collected from the informants, a total number of 222 medicinal applications are reported in this work, which can be categorized into 14 groups which heal disorders of the digestive system (27.92%) as the most common ailment treated by plants, followed by metabolic disorders (14.41%), cold-flu and fever (10.81%), and problems of the nervous system (7.2%)(Fig 6).These results are similar to other studies in which many medicinal plants were used to alleviate digestive disorders [9,10,12,15,16,44].Among these categories, some other ailments, such as constipation, diarrhea, and influenza, have been commonly treated in the Shahrbabak.

Comparison of different indices
Table 2 presents the results obtained from the ICF values for the categorized ailments.Metabolic (0.64) and musculoskeletal disorders (0.62) had the highest ICF value and included ailments such as kidney stones, urinary infections, diuretics and rheumatism, headaches and skeletal fractures.Also, the respiratory system (0.5) had a high ICF value and was followed by skin and hair (0.43), digestive system (0.4), nervous system (0.4), cold/flu/fever (0.4) and cuts/ wounds (0.4).For liver problems and flavor/appetizing, the ICF values were 0.37 and 0.36, respectively.When the ICF index is very low, informants do not exchange extensive amounts of information about the use of species to treat diseases [11].The digestive system was claimed to be treated most commonly (with 20 plants), followed by cold/flu/fever (with 15 plants),  sedative ailments (with 11 plants), the nervous system (with ten plants), cuts/wounds (with ten plants) and skin and hair (with nine plants).According to the current study, metabolic and musculoskeletal disorders are the most common ailments in the Shahrbabak region.The current findings seem to be consistent with other research findings which found that metabolic disorders had the highest ICF in Sirjan, a city in the Kerman province [15] and Rasuwa District in Central Nepal [45].Nonetheless, these results differ from some other published reports from Iran such as those carried out in the south of Kerman [16] and in the Kohgiluyeh and Boyer Ahmad provinces [46].Adiantum capillus-veneris L. and Plantago ovata are the most prized plants in this region, therefore many informants confirmed that these are useful plants (Table 3).The number of informants reporting a specific use for a plant species is called a 'Use Report' (UR).Artemisia aucheri had the maximum number of reports confirming its medicinal use (26 UR), followed by Centaurium pulchellum (22 UR), Salix mucronata Thunb.(21 UR), Diarthron lessertii (Wikstr.)Kit Tan. and Plantago ovata with (20 UR) (Fig 7).Sadat-Hosseini et al reported that Chrysanthemum parthenium (L.) Pers.and Cerasus mahaleb (L.) Mill.had the highest number of uses, reasserting their medicinal purpose (23) in the south of Kerman.[16].Nasab and Khosravi studied the Sirjan region in Kerman and discovered Malva sylvestris L. has the highest number of medicinal use reports [15].
Table 3 provide the results obtained from the RFC and CI indices, respectively.The most critical species according to the RFC index are A. capillus-veneris, P. ovata,Malva parviflora var.Parviflora.and Genista tinctoria L. It can therefore be suggested that these species are commonly recognized by many informants in the Shahrbabak.However, A. aucheri ranked first through the CI index, and Table 3 shows the ranking based on CI and RFC indices.These results differ from some published studies.For instance, Sadat-Hosseini et al reported that C. mahaleb and C. parthenium ranked first in Kerman's south [16], while Mosaddegh et al indicated that Teucrium polium L. ranked first in the Kohgiluyeh and Boyer Ahmad province [12].The linear regression model drawn between informants' age and the number of reported uses for a given medicinal plant is significant (P-value = 0.004; Fig 8).This indicates that older informants have more knowledge of the use of medicinal plants.RFC and CI indices show that the best-known plants have major chemical compounds (Table 4), such as 1,8-Cineol and αpinene.

Medicinal plants are used in combinations
In some cases, indigenous people treated diseases using a combination of medicinal plants.
For example, combining Foeniculum vulgare Mill., Elwendia persica (Boiss.)Pimenov & Kljuykov and Cuminum cyminum L. alleviated carminative and gastric discomforts.Also, the combination of Tanacetum parthenium (L.), Ocimum basilicum L. and Nepeta glomerulosa Boiss.was reported to be effective as a nerve tonic.Traditional medicine utilizes combined medicinal plants to enhance therapeutic and minimize side effects [47].By using this approach, new treatment strategies can be developed and local plants can be identified for drug development [48].

Side effects of medicinal plants
Informants believe combining F. vulgare, B. persicum, and C. cyminum can improve digestion.However, it can cause abortions in some women.Ferula species may also cause diarrhea in children and adults.Medicinal plants' side effects are influenced by an individual's reaction, dosage, preparation method, and interactions with other medications or health conditions [49,50].Esmaeilzadeh, reported that herbal combinations can benefit certain ailments, but also present risks, such as abortion risk for women [49].

Comparison of plants identified in the current study with previous studies
A comparison between this study and 14 similar studies (in Iran and other countries) was conducted to identify the plants that were reportedly medicinal in the current work for the first time in the available literature.Previous studies were carried out in various regions of Iran, including Sirjan [15], south of Kerman [16], Kohgiluyeh and Boyer Ahmad [12], Saravan [11], Turkmen Sahra [9] and West Azarbaijan [10].Other countries in which studies have been conducted include Pakistan [51,52], Sri Lanka [53], Brazil [54], China [55], Morocco [56], Italy [57] and India [58].Table 5 presents the results of the comparing medicinal plants with other reports.Following the literature review, 57 of the 141 species are reported here for the first time to have medicinal uses.Other results from this comparison showed that some plants had a more comprehensive distribution range but different uses.In various studies, for example, F. vulgare is reported to have other uses for the treatment of various ailments such as abdominal pain and bloating   [15], gastric discomfort, bone and joint pain [16], diuretic problems and kidney malfunctions [12].It could also be used to treat menstrual disorders, or as a lactiferous agent.It could be used to alleviate coughs, asthma and digestive disorders, while also serving as a nerve tonic [11].It is a carminative and hypnotic agent [9] and could treat hypertension [51], diabetes [56] and stomachache [57].

Criteria for evaluation
F-measure, Recall, Precision, Receiver Operating Characteristic (ROC) and Cohen's Kappa are used in this study to evaluate the performance of the proposed method [59].An important evaluation criterion in data mining is accuracy.Several studies have discovered the use of different assessment metrics in predictive modeling for medicinal plant uses.While accuracy is a vital criterion, it is significant to consider other metrics such as precision, recall, F-measure, Cohen's Kappa coefficient, and ROC analysis [60][61][62].These metrics can provide a more comprehensive understanding of model performance, particularly in multi-class classification problems.However, the choice of evaluation metric should be tailored to the specific objectives of the model, with accuracy being less suitable for particular applications [62].It is possible to  • TP: The algorithm classified the sample in the positive category and the sample is also positive.
• FP: The algorithm classified the sample in the positive category, but the sample is negative.
• TN: The algorithm classified the sample in the negative category and the sample was also negative.
• FN: The algorithm classified the sample in the negative category, but the sample is positive.
In other words, when the algorithm mispredicts the sample class, the result will be FN or FP.When the algorithm correctly predicts the sample class, the result will be TN or TP.By using the following ratio, we can determine the model's accuracy.
A model's accuracy is determined by its ability to detect the medicinal plant's mode of application correctly.The amount of data that can be recognized correctly equals the total number of available data.A model with a higher detection accuracy value will be more  accurate and reliable.Eq (1) shows the accuracy evaluation criteria.

Precision
This evaluation criterion is used when the proposed method positively predicts the outcome.
The precision criterion will be appropriate when the False Positive (FP) class detection accuracy value is high.Criteria for evaluating precision are given in relation (2).

Recall
The recall criteria are used to evaluate negative class detection accuracy.It is appropriate to use the Recall criterion when the false negative value (FN) is high.The Recall criterion is shown in Eq (3).

F-measure
A critical evaluation criterion for model accuracy is the F-measure.The two measures of Recall and Precision are combined to form this criterion.Eq (4) shows the F-measure criterion.

Cohen's kappa coefficient
Cohen's kappa coefficient is a numerical measure between -1 and +1, any measure closer to +1 indicates adequate performance, and the closer this value is to -1, it indicates disagreement.Cohen's kappa coefficient is given in Eq (5).
Receiver Operating Characteristic (ROC) shows the area under the curve (AUC).A ROC analysis is one of the most critical evaluation criteria for supervised learning models.We can create a ROC curve by plotting the True Positive Rate against the False Positive Rate.Since the threshold is variable, a continuous graph will result.

10-fold cross-validation
The K-fold cross-validation method proves the model's performance.The 10-fold cross-validation method divides the original sample into ten equal parts.In each iteration, nine parts are considered training data, and one part is considered test data until the entire data is scrolled.In this method, the presented model was trained and tested ten times, and the result is an average accuracy of ten times.The benefit of using this approach is that it mitigates the overfitting risks linked to random sampling [59].The results show that in the 70-30 split, the J48 decision tree algorithm correctly predicted the dataset samples with an accuracy of 95.24%.In the 10-fold cross-validation, the J48 decision tree algorithm correctly assigned new samples to their respective classes with 95% accuracy.Since the 10-fold cross-validation is the average of ten times, and the number of dataset records is small, we use the 10-fold cross-validation method for prediction.Based on Fig 10,there is not much difference between 10-fold cross-validation and 30-70 division, and since the J48 decision-tree algorithm achieved 95% accuracy with cross-validation, this model is used.The confusion matrix table was used to calculate the model value based on different evaluation metrics.Table 7 shows that the J48 decision-tree algorithm achieved high accuracy in each evaluation metric, indicating that it is a very accurate algorithm.It is also more efficient than other algorithms.

Conclusion
Using data mining analysis, we gained valuable knowledge about medicinal plants uses.We noticed clear preferences for specific plant families, including Lamiaceae, Fabaceae, and Apiaceae, which are strongly inclined to apply leaves to medicinal preparations.Based on the current study results, the following conclusions and suggestions are presented: • Focusing on documenting, standardizing, and preserving traditional knowledge and quality is crucial.We must carefully evaluate herbal combinations for potential side effects and consider dosage regulation and individual responses.• For better reproducibility and understanding of ethnopharmacological studies, it should be considered that differences in language dialects and cultural interpretations could have influenced the data collection process and introduced complexities in data interpretation.Additionally, ensuring data quality from local informants raises concerns about reliability.
• Furthermore, predictive modeling based on machine learning algorithms showed promise for predicting plant applications.However, future works will be challenged by limited data availability, model generalization across diverse regions, and indigenous knowledge conservation and utilization.
• Future studies could investigate these medicinal plants' compounds and biochemical properties using data mining algorithms.This scientific investigation could help identify their antibacterial, antifungal, antitoxic, or neutral properties.
While data mining provides valuable insight into medicinal plant usage, future studies should focus on standardization, ethical considerations, and strong model development.
Fig 1 illustrates the flowchart of the proposed method.

Fig 9 .
Fig 9.The amount of dispersion of samples of each class based on the class of "mode of application".https://doi.org/10.1371/journal.pone.0303229.g009 Fig 10 shows the test accuracy of different classification algorithms under the 70-30 split and 10-fold cross-validation.In the 70-30 split, 70% of the data was utilized for training the proposed model and 30% for testing the proposed model.

Table 3 . Comparison of important medicinal plants by using indices and species ranking based on each index.
RFC, ralative frequency of citation and CI, index of cultural importance.https://doi.org/10.1371/journal.pone.0303229.t003

Table 7 . Comparison of evaluation criteria of the proposed model. Algorithms criteria J48 decision tree Support vector machine neural network logistic regression
https://doi.org/10.1371/journal.pone.0303229.t007